Skip to main content

Incident Report: Stuck Event Collector on Mantle Staging Environment

Date: 2023-12-21
Time: 04:36 (GMT+2)
Duration: Approximately 2 days

Description

Detected stuck collector on mantle and was unable to get some transaction data or block for a while, causing delays and repeated alerts.

Root Cause

The issue stemmed from changes pushed to the event collectors on staging that affected Mantle. Additionally, the Reblok node had gotten stuck, further aggravating the issue.

Impact

Event collection was unavailable for Mantle for approximately 2 days, causing delayed or incorrect data reporting and alerts being repeatedly triggered.

Timeline

  • 04:36 - First noticed the issue.
  • 11:22 - Identified as a stuck event collector.
  • 11:25 - Initial push change by Aaron identified as potential cause.
  • 12:17(The other day) - Revert of changes initiated.
  • 12:17 - Issue of unreachable database identified and resolved.
  • 12:34 - Mantle syncing resumed and operational.

Lessons Learned

Regular checks and validations of changes in the staging environment are crucial to prevent prolonged issues. Immediate action and communication when issues are detected can minimize impact.

Actions Taken

  1. Pushed change to the event collectors on staging that affected Mantle.
  2. Identified and reverted changes causing the stuck event collector.
  3. Deployed old code to resolve immediate error messages.
  4. Addressed new issue of unreachable database.
  5. Monitored until Mantle resumed normal operations and synced completely.

Incident Reviewer(s)

  • Aaron (Resolved staging collector issue)
  • Jiri Herzan (Identified event collector issue and followed up)
  • Andrew Prasaath (Reported the initial and ongoing issue)